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Analysis of Word ^raquancies in Spoken Lenuruage of Children * 



Ernst G* Beier - University of Utah 
John A. Starkireother - University of California 
Don £• Miller * University of Utah 



Abstract 

Zipf (1965) states that a statistical relationship has been established 
between high frequenctr, small variety and shortness in length of .words, a 
relationship which is presumabl'«* valid For lantniage in general » Zipf based 
his work on the analysis of written language. The present study is COtlC6m6( 
with discovering whether this law also holds for spoken language of children 
and if age diFferences influence this relationship of variety and freouenev 
of occurrence of words, as well as the freauenev of specific word groups 
(such as negative words, self-reference etc,). For this purpose 15 12-year 
old boys and 15 16-year old boys of average intelligence were given a snail 
tape recorder to obtain their verbal output, ^orty thousand words of each 
of the groups were analysed with the Starkweather program on on IBM 7094 
computer. The results are presented in terns of lists of words used, the 
ratios of the nunber of different words spoken to the number of total 
words, the ratio of variety to frequency of occurrence. It was thought 
that this study contributes to a better imderstanding of children's spoken 
language and the growth of their available vocabulary. 
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Analysis of Word Frequencies In Spoken Language of Children 
A Model for an Intercultural Study 
Ernst G. Beler - University of Utah 
John A. Starkveather - University of California 
Doq E. Hiller - University of Utah 

Introduction 

This study was concerned with establishing certain base rates In 
lan(>uage usage of children and Investigating some of their psychological 
significance. 

Related Literature 

A number of studies have Investigated language behavior In children 
with reference to specific variables such as diversity, soclo*econonlc 
background, the type-token ratio (TTR) and others. Chotlos (1944) 
Investigated the effects of age, socio-economic level, 10 and sex on 
language behavior. He conputed the TTR from 108 Ss selected from 1000 
children between eight :ind 18 years of age and found that diversity as 
measured with the TTR Increased with age and 10, but seemed Independent 
of location and sex. Busemann'^s (1925) "action quotient" (l.e. verbs 
as related to adjectives, nouns and participles) also Increased with age 
and IQ. Bernstein (1962) reported a language study carefully controlled 
for social class differences. He postulated different verbal planning, 
orientation In different social classes and proposed an "elaborated" 
and "restricted" linguistic code which Is based on phrase length, word 
length and pausing* He found social class differences, with lower socio- 
economic class mspbers using longer mean phrase length, less pausing, but 
(with IQ equated) no shorter word length. Pringle and Tanner (1958) 



recorded coiiversatlon of pra-tchool children in controlled and spontaneous 
conditions. They found deprivation in languat^e skill develoonent of pre- 
school children in care hones. This conclusion was also supported by 
HcCarthy (1954), Verplank (1955) and Milner (1951). Mlnifle, Oarley and 
Sherraan (1963) obtained three different language sarples fron five and 
elghii:-year old children using picture cards as stlnuli. They found 
relatively low temporal reliability at both age levels, which perhaps is 
due to the snail sanpling of lanf?uare. Snltb (1926), who recorded 
spontaneous conversation of 88 children, noted a great variability within 
age groups and suggested that other factors than age would be responsible 
for the variance. Lorge-Thomdlke word frequency scales (1944) were 
based on written language, and so is Fraprle's (1950) scale of nost 
comon words. West (1953) and Rosenzwelg and McNeill (1962) pointed out 
irnportant differences between the scales. Zlpf's work (1935) which gave 
us Zlpf's lavs support that frequency of occurrence of words Is inversely 
related to word length as well as the total nuciier of words of a given 
sample. Voelker (1942) reported the 1000 nost frequent words extracted 
fron 100,000 words of spoken language of high school seniors, college 
freshnen and seniors. This work will cocnlenent the proposed research.. 

A nuc:her of investigators have looked into the relationship of oral 
and written language « Horowits and Hewnsn (1964) and i^^aisse and Bxeyton 
(1959) found a greater variety of words and specifically nore verbs in 
oral speech. Hosco; 7 ici and Hurbert (I960), howaver, found that Zlpf's 
laws were supported with his snail sanple of both oral and written language 
probes. Strickland (1962) reported that "oral language la children is more 
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advanced than the language of the books In which they are taught to read." 
Orlertan (1962) took written and oral tecta fron a anall aanple of 
psychology students and found In their ifrltten language: longer words, 

fewer oneway liable words, pore different words of one syllable, a nore 
varied vocabulary. Word frequency studies are also used for diagnostic 
purposes. Gleser, Ooldlne and Gollschalk (1959) found In five ralnute word 
sanples with 90 ^s, differences correlated v#lth intelligence and sei:. 
Correlation between word frequencies and Intelligence were also found by 
Zlpf (1937). Felrbanks (1944) copputed TTR for 30 100«word segnents of 
snoken language fron schlsophrenlcs and college freshnan. He found the 
TTR significantly lower for the patients, a study also supported by Mann 
(1944). Barley, Shaman and Siegel (1959) reported an Interesting new 
look at word frequencloe from a diagnostic point of view. They had 35 judges 
rank 572 frequent nouns, verbs and adjectives In terns of their scale values 
on a "level of abstraction," and found that these words could be tellably 
scaled^ Fleech (1950) built a nore copplex reliability score based on a 
relatloiishlp of "definite" to "total" words. Such mthodi nay be usef^ii 
to utilise word usage of children to assets the tools they have available 
for learning aa well as apeclflc deficits such as nay result In dropping 
out of school. 

The new inatrurents available for neasurenent of language developnent 
auch as recording devices and coiapviter prograns have been utilised for 
various Inportant efforts, but apparently no nomatlve data on word 
frequancles of ap^wn language such .as Lorge^Thomdike prapared for written 
language ace yet available. 
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21pf (1965) derived bit laiirs on the basts of written language; we were 
Interested In testing sone of his laws on sooken language of children. 

Since Zlpf co!*!pares the lawfulness of lanpuat^e on different languages, 
such as Chinese, Latin, German and English, our nodel may also serve Inter- 
cultural comparisons. The tise of spoken language as raw material became 
practical, when the necessary Instruments become ovalliible. These 
Instruments Included a small pocket recording Instrument which could record 
4 hours of speech and a vocal .analysis ccnputer program developed by 
Starkweather (1964) which prints out wtrd frequencies from texts typed on 
I3‘t cards. 

In our present study, we recorded samples of sooken language of thirty 
boys, all of normal Intelligence (90-110) as measured by 3 subtests of the 
Wechsler (Information, sinllarlty, vocabulary). Fifteen of the boys were 
12 years old and In the 6th tirade, the other 15 bovs were 16 years old and 
In the tenth grade. We Instructed the boys to use the recorders after 
school, not ’.:o rive any special speeches or read into the recorder. When 
we had 5000 words recorded for each boy. We selected about 2700 words and 
typed them on IBM cards. Altogether we had sone 80,000 words for processing. 
The computer provided us with Individual nrlntouM^ well as data of 
several word lists, such as the positive words, (yes, okay) negative words 
(no, never, none, etc.) self— reference singular (I, ne, nine) self-reference 
plural (we, our) other references (you, they, them) question words (why, 
who, what, where) the most freouent 1-4 letter words, and the type token 
ratio (Ro. of different words over No. of total words), here we were 
primarily concerned with the followlnr questions! Are Zlpf's findings 
derived from written languege, applicable to spoken language? The two laws 




5 



with which wc Arc concerned sre (1) the nuciier of different words used 
incraasss as the fraqusneiss of occurrence becorios sosller» and (2) that 
the mtniitude of words tends to stand in an inverse relationship to the 
nwAer of occurrences. In addition, we shall insiiect our data to 
discover just precisely how our two afje erroups differ on the various 
variables under investigation. 

Results 

Zipf (page 24) quotes Bldrige for saniplei of Aiaerican newspaper 
English totaling 43,989 words representing 6002 different words ^ We shall 
coii!pare his sartple with our saiaple of sooken language of 6th and 10th grades 
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Conparlson of Frequency of Occurrence \7±th Kuinber of Different Words Used 
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Eldrlge Beler 6th Grade Beier 10th Grade 

?req. of No. of No. of No. of 

Occurrences Dlff. Words Dlff. Words Dlff. Words 
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Inspecting Table 1 we can see that our satnples behave very slrillarly 
to the Eldrlge sanple quoted by Ztpf, Clearly, the nunber of different words 
Increases as the frequency of occurrence decreases. We also note sone 
Interesting differences. The spoken language of these children enconpasses 
only about one half of the total different words used. This creat paucity 
of language nay be due to are, or It nay be due to the fact that we axe 
measuring spoken language. We note with particular Interest that the 
nunber of different words which occur only once In the spoken language sample 
Is disproportionately smaller than such single words used In the newspaper 
sanple. Apparently our sample did cut down on variety of expression. As 
the older grade has a slight Increase In the nunber of different words used, 
we reason that at least part of this variety may be due to >e. This does 
not exclude the possibility that written language as such enhances 
dlfferenclatlon expression over spoken language* 

The magnitude cf words In Eldrlge* s data appears to rest on an estimate 
of the word size. His estimate Is based on an average number of phonemes 
(Individual sounds)^ We, however, shall use a somewhat slrq>ler measure 
(letters) which would not be exactly equivalent but would serve our purpose 
cf presenting an estimate about the magnitude of words in spoken language » 

In table 2 we shall summarize the Eldrlge data and present the data for 
our sample. 
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Table 2 

Cotiparlsoti of m.crnitude of words and fraqiiency of occurrence 



Frequency of 


Elcrige Newspaper 


Beier 


6th 


Beier 


lOth 


Occurrence 


Magnitude 


grade 


grade 


1- 4 


6 nh 


5.4 


letters 


5.5 


letters 


5-10 


5 ” 


4.7 


if 


4.3 


II 


15-20 


4 " 


4.1 


If 


4.6 


II 


21-30 


3.5 '* 


3.7 


II 


4.2 


II 


30-50 


3.9 ** 


3.4 


II 


4.0 


II 


51-60 


3.3 ” 


3.3 


II 


3.9 


II 


61+ 


2.7 ** 


2.5 


II 


2.7 


19 



VIhile we used letters rather than phoneines tr esoimte the ma^itude 
of the words, we obtained again a sonewhat sirllar grading for children's 
spoken language as conpared with newspaper English. A decrease in 
nagnitude as related to an increase of frequency is certainly observable. 

It Is interesting to note that in our approximtlon we also discovered that 
age influences the nagnltude; the 6th grade has a sooswhat broader 
distribution of shorter words as related to frequency of occurrence. It 
should be noted, however, that the total fiunber of short words used remains 
relatively alike in both sarnies. 

Table 3 

Rate of short word usage by grade 



(Most frequent) 


6th grade 


Rate 


10th grade 


Rate 




N 




N 




1 letter words 


2373 


0.055 


2664 


0.061 


2 letter words 


8544 


0.199 


8605 


0.198 


3 letter words 


10140 


0.236 


9975 


0.230 


4 letter words 


6528 


0.152 


6859 


0.158 


Total word sample 


42924 




43406 
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Ic appears that both grades use short words relatively evenly, with 
* 

a slightly higher use of 1 letter words In the higher grade* We shall 
now Inspect, in answer to our second group of questions, the fraquencles 
of our word lists* 



Table 4 

A. comparison of word lists by grade 



Lists 


All 




6th Grade 


10th Grade 




Words 


Rate 


Words 


Rate 


Words 


Rate 


Positive 


1326 


0*015 


513 


0*012 


813 


0.019 


Negative 


2442 


0*028 


931 


0*022 


1511 


0.035 


Singular, self 


3950 


0*046 


1712 


0*040 


2238 


0.052 


Plural, self 


1311 


0*015 


822 


0*019 


489 


0.011 


Others 


6716 


0*078 


2962 


0*069 


3754 


0.086 


Question 


1311 


0*015 


522 


0*012 


789 


0.018 


Total Words 


86,329 




42,924 




43,406 




Total 

Different 

Words 


4,567 




3,096 




3,121 




TTR 


0.053 




0*072 




0*072 





In «hls table we discover that the tenth grade uses both positive v#ords, 
negative worwis, singular self-reference, other reference to other, and 
question words g»re frequently than the sixth grade* The sixth grades use 
only plural self •reference r!ora frequently* Wa prepared a correlation matrix 
of these frequencies which is presented in Table 5* 



Table 5 



Correlation Matrix of Children's Word Frequencies as Related to Grade 



Grade 


Grade 


WPM 


1 Syl. 


Poo. 


Neg. 


S.R.S. 


3.R.P. 


0th. Qu. TT 


Words per Minute 


59 
















1 Syllable Words 


••15 


4 














Positive 


41 


17 


-11 


— 


— 


— 


— 


— — — 


Negative 


70 


44 


-14 


42 


— 


— 


1 — 


% 


Self-Reference Sing* 


36 


17 


4 


15 


65 




— 


— — — 


Self-Reference Plural 


-35 


-23 


42 


-33 


-52 


-46 


— 




Other 


59 


30 


-30 


36 


50 


34 


-67 




Question Words 


38 


0 


«-29 


58 


48 


44 


-40 


53 — — 


Type Token 


4 


0 


-39 


-9 


-5 


0 


-35 


23 3 — 


•296 ■ 05 level of significance 
.349 • 025 
.409 - 01 
•449 • .005 













When we Interpret the coefficients which are significant at the 0,1 
level of sifmiflcance* we find that the older boyst 
1. Speak faster than the younger boys* 

2* They use significantly i9ore"positiv# words* 

3* They use significantly isore "negative" words* 

4* Slightly fwre "singular" self-reference* 

5* Slightly less "plural" self-reference. 

6* They use nore "other" references* 

7* They use slightly nore 'huestloii' words* 
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It Is Intsrsstlng to note that with latslllganct equatad, ths type 
token ratio doat not dlffarenclnta the boys* 

We prepared a factor analysis from this data (including sore variables 
not discussed here) which yielded the following factors: 

FACTI^S 

(including variables loading .30 or rore) 

1. One syllable one letter words .85, 3 letter one syllable words -49, 
negative 61, salf-rafaranca singular 91, self-reference plural -48, question 
30, ^ -59, I 91, we -52, ^ou 36, ^ -60, «iat 44. 

This factor seens to describe people who are self oriented, use a lot 
of i or self-reference singular words. They don*t refer to ”us” or **we” 

but to "you” they use nore negative words. This could be called an egocentric 
factor* They also ask tx)re questions. 

2. TWo letter one syllable words 38, total one syllable words 39, plural 

**^^""*'*f®**®®* 59, other reference —42, words used once —87, type token —89, 
we 57, infomation -43. 

This factor seens to describe a less bright group, (have less infomation 
on hand). They refer triore to **we," and "us," but not others. They use less 
words once, have a lower type token ratio. This could be a "closed group" 

factor describing close group oriented people who don't hove as much access 
to a variety of language usages. 

3. Grade -«4, achievement -54, IQ -73, positive -39, negative -53, other 
reference -46, age -91, yrou 30, Inforratlon 34. 

This factor seems to describe a less bright, younger, 6th grade group 
who use less negative and positive words, use fewer references to others, 
(including you"). However, they have a higher information score* 




0 
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A, Positive 67, sslf^rsfsrence plural -32, other reference 56, question 80, 
totol words 48, 3d -60, ja^ 77, ^ -33, l£ 41, A 38, Infomatlon 38. 

This factor seens to describe an '*other oriented" group who use nore 
Dosltlve words, use laore question words, use shorter words, use "you" much 
tx>re often. 

5. WPM 32, 2 letter 1 syllable words -76, 3 letter one syllable words 66, 

positive -34, jg|je 33, ijt -32, -66. 

This factor seems to describe a group of somewhat faster talkers who 
use leos positive words, less "is," "It," nore "the." Use less 2 letter 
one syllable words, nore 3 letter one syllable words. Not too meaningful 
a factor. 

6. IQ 33, 1 letter single syllable 40, A 87, Information 45. 

Thio factor describes a brighter group who use "a" nore, use nore 1 
letter single syllable words. 

7. Achievement -64, IQ -46, modified achievement -82, vocabulary -72. 

This factor seems to describe a group of low achievers and low IQ 
(less bright) people. 

8. Total one syllable words 76, vocabulary -46. 

People who use a lot of one syllable words get lower HAIS vocobulary 
scores. 

9. IQ 60, 4 letter one syllable words -79, self-reference plural 30, other 
reference -31, thg. 31, It -37, that^-59, similarities 75. 

This factor seems to describe a brighter group who use less one 
syllable 4 letter words, use more self-reference, less references to others, 
His. iQore, and that less. 
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^ilft th«t« factors shall xiot ba Intsrpratad at this point, tha.f 

nay bs of dlaf^aostie sipnificanca, particularly if thasa ralatious arti 

* 

maintainad in future studias. Finally, ua shall prasant a listing of tha 
laost fraquant words usad by our 2 aga groups* 

Table 6 

Host Fraquant Words Usad by 6th 6 10th Grades 





6th Grade 


Mta 




lOth Ctad* 


Rota 




and 


2099 


4*89 




1 


1826 


4*20 




tha 


1436 


3*23 




you 


1592 


3*66 




1 


1356 


3*15 




and 


1119 


2*80 


42*924 


it 


1190 


2*77 


43*406 


tha 


1102 


2*53 


Total 


you 


1079 


2*51 


Total 


not 


1032 


2*37 




to 


1039 


2*42 




to 


1024 


2*35 




a 


1017 


2*31 




is 


978 


2*25 




is 


797 


1*85 




that 


861 


1*98 




that 


769 


1*79 




a 


837 


1*92 




wa 


693 


1*61 




do 


647 


.86 


Ws note 


that only two of 


thasa most fraquant words 


are not present in 



both samples, tha wordst '*it** and 'Nfa** oaks tha first tan with tha 6th 
grades, tha words "not” and "do** naka tha first tan with tha 10th grades* 
Where does all this data taka ust Wa finmd with our sampling of sons 
80,000 words of spoken language of 30 boys of two grades and equated 
intelligence, that Zipf*s laws seam to be applicable to sookan as wall as 
to written language* Wa found that the English language is surprisingly 
consistent as spoken by thasa children* They uiia about 20Z different words 
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In thM« 2700 word tainplts* found a ounbur of dlfforuncos in th« 

■ookon Ina^gu pntturn of our tuo groiq>a» tudi at cha faatar tpaach, 
tha largar meimt of “tliay'a,” 'V*®**#** ”no*a," and **he^*a," in 

tha oldar group and *Va'a" in tha youngar group* Ybia could ba conaidarad 
aa baaa^llna data In apokan languana of ehildran and aueh data onca 
conflmad* nay rury wall ba uaad In nany waya: auch aa in aaaaating 

indiwiduala* daficita throui^ payeliolinj(»uiatic profilaa, to balp in building 
raading mtarial which can ba oaaily undaratood, to undaratand tha 
davalopoantal aaquancaa in language davalopnant, to obtain national aanplaa 
and last but not laaat» to cunpara warioua culturaa with aach other in their 
psycholingttiatic davalopiaant* 

Wo are praaantly preparing canplat of apokan language of retarded 
children, gifted children, school dropouta, and of the oaranta of our 
diildran* Ifa want to undaratand tha naaning of paycholinguiatic indicatora 
and laam about language dawalopisant* Wa alao are noat intaraatad in 
atiuttlatlng psycholinipiiatic atudiea of children in other culturaa to 
obtain conparativa data* Language after all, ia the baaic tool of 
cofsninication aoong nan and ita uaage ahould rewaal aignif icant infonaation 



of the culture they ll've in. 
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